Our project consist of exploratory data analysis (EDA) and predictive analytics on 2016 Global Ecological Footprint dataset. We choose this dataset since ecological footprint is very important measure which can tell us about magnitude of our impact on our planet. Learning from ecological data can be very useful since it can point to variables which are hurting nature the most.
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(corrplot)
library(psych)
library(leaflet)
library(naniar)
library(knitr)
library(kableExtra)
library(plotly)
library(cowplot)
library(caret)
library(randomForest)
library(rpart.plot)
library(rpart)
library(rgl)
dat <- read.csv("data/countries.csv", encoding="UTF-8", stringsAsFactors = F)
glimpse(dat)
## Observations: 188
## Variables: 21
## $ Country <chr> "Afghanistan", "Albania", "Alge...
## $ Region <chr> "Middle East/Central Asia", "No...
## $ Population..millions. <dbl> 29.82, 3.16, 38.48, 20.82, 0.09...
## $ HDI <dbl> 0.46, 0.73, 0.73, 0.52, 0.78, 0...
## $ GDP.per.Capita <chr> "$614.66", "$4,534.37", "$5,430...
## $ Cropland.Footprint <dbl> 0.30, 0.78, 0.60, 0.33, NA, 0.7...
## $ Grazing.Footprint <dbl> 0.20, 0.22, 0.16, 0.15, NA, 0.7...
## $ Forest.Footprint <dbl> 0.08, 0.25, 0.17, 0.12, NA, 0.2...
## $ Carbon.Footprint <dbl> 0.18, 0.87, 1.14, 0.20, NA, 1.0...
## $ Fish.Footprint <dbl> 0.00, 0.02, 0.01, 0.09, NA, 0.1...
## $ Total.Ecological.Footprint <dbl> 0.79, 2.21, 2.12, 0.93, 5.38, 3...
## $ Cropland <dbl> 0.24, 0.55, 0.24, 0.20, NA, 2.6...
## $ Grazing.Land <dbl> 0.20, 0.21, 0.27, 1.42, NA, 1.8...
## $ Forest.Land <dbl> 0.02, 0.29, 0.03, 0.64, NA, 0.6...
## $ Fishing.Water <dbl> 0.00, 0.07, 0.01, 0.26, NA, 1.6...
## $ Urban.Land <dbl> 0.04, 0.06, 0.03, 0.04, NA, 0.1...
## $ Total.Biocapacity <dbl> 0.50, 1.18, 0.59, 2.55, 0.94, 6...
## $ Biocapacity.Deficit.or.Reserve <dbl> -0.30, -1.03, -1.53, 1.61, -4.4...
## $ Earths.Required <dbl> 0.46, 1.27, 1.22, 0.54, 3.11, 1...
## $ Countries.Required <dbl> 1.60, 1.87, 3.61, 0.37, 5.70, 0...
## $ Data.Quality <chr> "6", "6", "5", "6", "2", "6", "...
dat$GDP.per.Capita <- as.numeric(gsub('[$,]', '', dat$GDP.per.Capita))
# conversion to factors
dat$Country <- as.factor(dat$Country)
dat$Region <- as.factor(dat$Region)
We can see that 19 of the columns are numeric types. Column GDP.per.Capita had to be converted to double type. Country and Region were converted to factors. We will ommit Data.Quality because we will not use it anywhere in this project.
dat$Data.Quality <- NULL
It’s always important to check for missing values and consider how to fix them.
missing_data <- dat %>% summarise_all(funs(sum(is.na(.))/n()))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
##
## # Before:
## funs(name = f(.)
##
## # After:
## list(name = ~f(.))
## This warning is displayed once per session.
missing_data <- gather(missing_data, key = "variables", value = "percent_missing")
missing_data <- missing_data[missing_data$percent_missing > 0.0, ]
ggplot(missing_data, aes(x = reorder(variables, percent_missing), y = percent_missing)) +
geom_bar(stat = "identity", fill = "red", aes(color = I('white')), size = 0.3, alpha = 0.8)+
xlab('variables')+
coord_flip()+
theme_fivethirtyeight() +
ggtitle("Missing Data By Columns",
subtitle = "HDI has the highest percentage")
table1_dat <- dat[is.na(dat$HDI), c(1,2)]
rownames(table1_dat) <- NULL
table1_dat %>% kable(caption = "Countries with missing data") %>% kable_styling("striped", full_width = T) %>% row_spec(c(4,15), bold = T, background = "lightblue")
| Country | Region |
|---|---|
| Aruba | Latin America |
| Bermuda | North America |
| British Virgin Islands | Latin America |
| Cayman Islands | Latin America |
| Côte d’Ivoire | Africa |
| French Guiana | Latin America |
| French Polynesia | Asia-Pacific |
| Guadeloupe | Latin America |
| Korea, Democratic People’s Republic of | Asia-Pacific |
| Martinique | Latin America |
| Montserrat | Latin America |
| Nauru | Asia-Pacific |
| New Caledonia | Asia-Pacific |
| Réunion | Africa |
| Somalia | Africa |
| Wallis and Futuna Islands | Asia-Pacific |
They are all pretty much small countries (with exceptions like Côte d’Ivoire, Somalia etc.) .
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.608 42.16 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.519 40.71 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.593 41.91 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.662 43.03 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.736 44.21 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.706 43.74 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.644 42.73 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.641 42.69 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.61 42.18 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.582 41.74 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.675 43.24 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.628 42.48 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.664 43.06 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.712 43.84 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.724 44.03 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.655 42.92 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.633 42.56 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.762 44.65 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.765 44.70 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 300 | 2.735 44.21 |
It’s useful to show histograms of all numeric columns.
multi.hist(dat[,sapply(dat, is.numeric)])
Most of the variables (Population..millions, Fishing.Water, etc.) have right skewed distributions. We will inspect Total Ecological Footprint more detaily since it is our target variable.
dat %>% ggplot(aes(x = Total.Ecological.Footprint)) +
geom_histogram(bins = 30, aes(y = ..density..), fill = "purple") +
geom_density(alpha = 0.2, fill = "purple") +
theme_fivethirtyeight() +
ggtitle("Total per Capita Footprint") +
theme(axis.title = element_text(), axis.title.x = element_text()) +
geom_vline(xintercept = mean(dat$Total.Ecological.Footprint), size = 2, linetype = 3) +
annotate("text", x = 7, y = 0.35, label = "Average per Capita Footprint is 3.32")
dat %>% arrange(desc(Total.Ecological.Footprint)) %>% select(Total.Ecological.Footprint, Country) %>% head(n = 15) %>% dplyr::rename(Footprint_per_Person = Total.Ecological.Footprint) %>%
kable(caption = "Biggest polluters - by countries", col.names = c("Footprint per Person",
"Country")) %>%
kable_styling("striped", full_width = F) %>%
row_spec(1:5, bold = T, color = "white", background = "#D7261E")
| Footprint per Person | Country |
|---|---|
| 15.82 | Luxembourg |
| 11.88 | Aruba |
| 10.80 | Qatar |
| 9.31 | Australia |
| 8.22 | United States of America |
| 8.17 | Canada |
| 8.13 | Kuwait |
| 7.97 | Singapore |
| 7.93 | United Arab Emirates |
| 7.92 | Trinidad and Tobago |
| 7.78 | Montserrat |
| 7.52 | Oman |
| 7.49 | Bahrain |
| 7.44 | Belgium |
| 7.25 | Sweden |
dat %>% arrange((Total.Ecological.Footprint)) %>% select(Total.Ecological.Footprint, Country) %>% head(n = 15) %>% dplyr::rename(Footprint_per_Person = Total.Ecological.Footprint) %>%
kable(caption = "Smallest polluters - by countries", col.names = c("Footprint per Person",
"Country")) %>%
kable_styling("striped", full_width = F) %>%
row_spec(1:5, bold = T, color = "white", background = "#56AB6F")
| Footprint per Person | Country |
|---|---|
| 0.42 | Eritrea |
| 0.48 | Timor-Leste |
| 0.61 | Haiti |
| 0.72 | Bangladesh |
| 0.79 | Afghanistan |
| 0.79 | Pakistan |
| 0.80 | Burundi |
| 0.81 | Malawi |
| 0.82 | Congo, Democratic Republic of |
| 0.87 | Mozambique |
| 0.87 | Rwanda |
| 0.91 | Tajikistan |
| 0.93 | Angola |
| 0.98 | Nepal |
| 0.99 | Madagascar |
k <- dat[, sapply(dat, is.numeric)]
k <- k[complete.cases(k), ]
korelacija <- cor(k)
corrplot(korelacija, method = "color", tl.cex = 0.825, title = "Pearson correlation", mar=c(0,0,1,0))
k2 <- dat[, sapply(dat, is.numeric)]
k2 <- k2[complete.cases(k2), ]
korelacija2 <- cor(k2, method = "spearman")
corrplot(korelacija2, method = "color", tl.cex = 0.825, title = "Spearman correlation", mar = c(0,0,1,0))
We can see that the results are different, so we can conclude that the linear relationship is probably not the best guess.
dat %>% group_by(Region) %>% tally() %>%
ggplot(aes(x = reorder(Region, n), n)) +
geom_bar(stat = "identity", fill = "lightgreen") +
theme_fivethirtyeight() +
ggtitle("Number of countries by regions") +
geom_text(aes(x = Region, y = 1, label = paste0(n)),
hjust=0.15, vjust=.5, size = 4, colour = 'black',
fontface = 'bold') +
coord_flip()
## Warning: Ignoring unknown parameters: binwidth, bins, pad
We will use linear regression to explain per capit footprint by serveral predictors. Both economical and geographical features will be used.
linear_model1 <- lm(Total.Ecological.Footprint ~ HDI, data = dat)
summary(linear_model1)
##
## Call:
## lm(formula = Total.Ecological.Footprint ~ HDI, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6459 -0.9784 -0.3301 0.6633 10.3107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.3021 0.5423 -7.933 2.74e-13 ***
## HDI 11.0241 0.7706 14.306 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.572 on 170 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.5462, Adjusted R-squared: 0.5436
## F-statistic: 204.7 on 1 and 170 DF, p-value: < 2.2e-16
linear_model2 <- lm(Total.Ecological.Footprint ~ I(exp(HDI)), data = dat)
summary(linear_model2)
##
## Call:
## lm(formula = Total.Ecological.Footprint ~ I(exp(HDI)), data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5625 -0.9288 -0.4152 0.5897 10.0971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.3649 0.7785 -10.74 <2e-16 ***
## I(exp(HDI)) 5.7853 0.3829 15.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 170 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.5731, Adjusted R-squared: 0.5706
## F-statistic: 228.3 on 1 and 170 DF, p-value: < 2.2e-16
linear_model3 <- lm(Total.Ecological.Footprint ~ HDI + I(HDI**2), data = dat)
summary(linear_model3)
##
## Call:
## lm(formula = Total.Ecological.Footprint ~ HDI + I(HDI^2), data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5236 -0.8254 -0.2381 0.3253 9.5480
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.502 1.985 3.275 0.001280 **
## HDI -23.827 6.238 -3.820 0.000188 ***
## I(HDI^2) 26.481 4.709 5.623 7.61e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.447 on 169 degrees of freedom
## (16 observations deleted due to missingness)
## Multiple R-squared: 0.6178, Adjusted R-squared: 0.6132
## F-statistic: 136.6 on 2 and 169 DF, p-value: < 2.2e-16
slr <- ggplot(dat, aes(HDI, Total.Ecological.Footprint)) +
geom_point(aes(text = Country)) +
geom_smooth(method= "lm", color = "red", linetype = 1, se=F) +
geom_smooth(method= "lm", formula = (y ~ x + I(x**2)), color = "blue", linetype = 2, se=F) +
ggtitle("Simple Linear Regression",
subtitle = "Model With Quadratic Term does much better")
## Warning: Ignoring unknown aesthetics: text
ggplotly(slr, tooltip = "text")
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
## Warning: Removed 16 rows containing non-finite values (stat_smooth).
slr <- ggplot(dat, aes(GDP.per.Capita, Total.Ecological.Footprint)) +
geom_point(aes(text = Country)) +
geom_smooth(method= "lm", color = "red", linetype = 1, se=F) +
geom_smooth(method= "lm", formula = (y ~ x + I(x**2)), color = "blue", linetype = 2, se=F) +
ggtitle("Simple Linear Regression",
subtitle = "Model With Quadratic Term does much better")
## Warning: Ignoring unknown aesthetics: text
ggplotly(slr, tooltip="text")
## Warning: Removed 15 rows containing non-finite values (stat_smooth).
## Warning: Removed 15 rows containing non-finite values (stat_smooth).
ggplot(dat, aes(x = HDI, y = GDP.per.Capita)) +
geom_point() +
theme_fivethirtyeight() +
ggtitle("HDI vs. GDP per Capita") +
theme(axis.title = element_text(), axis.title.x = element_text())
## Warning: Removed 17 rows containing missing values (geom_point).
It should be obvious that HDI and GDP are not linearly correlated. Exponential function would be much more suitable.
multiple1 <- lm(Total.Ecological.Footprint ~ GDP.per.Capita + HDI + I(GDP.per.Capita**2) + I(HDI**2) , data = dat)
summary(multiple1)
##
## Call:
## lm(formula = Total.Ecological.Footprint ~ GDP.per.Capita + HDI +
## I(GDP.per.Capita^2) + I(HDI^2), data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8412 -0.7107 -0.1608 0.5324 5.3628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.853e+00 2.263e+00 -1.261 0.20920
## GDP.per.Capita 8.304e-05 2.503e-05 3.318 0.00111 **
## HDI 1.100e+01 7.758e+00 1.419 0.15791
## I(GDP.per.Capita^2) -1.535e-10 2.077e-10 -0.739 0.46083
## I(HDI^2) -5.035e+00 6.603e+00 -0.762 0.44690
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.259 on 166 degrees of freedom
## (17 observations deleted due to missingness)
## Multiple R-squared: 0.715, Adjusted R-squared: 0.7081
## F-statistic: 104.1 on 4 and 166 DF, p-value: < 2.2e-16
ggplot(dat, aes(x = Population..millions., y = Total.Ecological.Footprint)) +
geom_point() +
theme_fivethirtyeight() +
theme(axis.title = element_text(), axis.title.x = element_text()) +
ggtitle("Using population as predictor",
subtitle = "not really useful as predictor") +
annotate("text", x = 800, y = 10,label = "Pearson correlation is -0.06", color = "red", size = 9)
crops <- ggplot(dat, aes(x = Cropland, y = Total.Ecological.Footprint)) +
geom_point(aes(size = Total.Ecological.Footprint), alpha = 0.4, color = "#e6a526") + theme_fivethirtyeight() +
ggtitle("Crop Land") +
theme(axis.title = element_text(), axis.title.x = element_text(), legend.position="none")
urban <- ggplot(dat, aes(x = Urban.Land, y = Total.Ecological.Footprint)) +
geom_point(aes(size = Total.Ecological.Footprint), alpha = 0.4, color = "#FF7F50")+
theme_fivethirtyeight() +
ggtitle("Urban Land") +
theme(axis.title = element_text(), axis.title.x = element_text(), legend.position="none")
forest <- ggplot(dat, aes(x = Forest.Land, y = Total.Ecological.Footprint)) +
geom_point(aes(size = Total.Ecological.Footprint), alpha = 0.4, color = "#56AB6F")+
theme_fivethirtyeight() +
ggtitle("Forest Land") +
theme(axis.title = element_text(), axis.title.x = element_text(), legend.position="none")
fishing <- ggplot(dat, aes(x = Fishing.Water, y = Total.Ecological.Footprint)) +
geom_point(aes(size = Total.Ecological.Footprint), alpha = 0.4, color = "#504FB1")+
theme_fivethirtyeight() +
ggtitle("Fishing Water") +
theme(axis.title = element_text(), axis.title.x = element_text(), legend.position="none")
cowplot::plot_grid(urban, forest, fishing, crops, ncol = 2, nrow = 2,
label_y = "Total Footprint")
## Warning: Removed 15 rows containing missing values (geom_point).
## Warning: Removed 15 rows containing missing values (geom_point).
## Warning: Removed 15 rows containing missing values (geom_point).
## Warning: Removed 15 rows containing missing values (geom_point).
It can be seen increasing trends in all variables, except for Forest Land.
HDI <- dat$HDI
GDP <- dat$GDP.per.Capita <- as.numeric(gsub('[$,]', '', dat$GDP.per.Capita))
final_multiple <- lm(Total.Ecological.Footprint ~ GDP.per.Capita + HDI + I(GDP.per.Capita**2) + I(HDI**2) , data = dat_train_imputed)
predicted <- predict(multiple1, newdata = dat_test)
knitr::include_graphics(c("visual/cube1.png", "visual/cube2.png"))
RMSE_mlr <- sqrt(mean((predicted[complete.cases(predicted)]- dat_test$Total.Ecological.Footprint[complete.cases(predicted)])**2))
Regression trees are simple regression tehnique which can be easily interpreted and visualised. Disadvantages are that they are instable and sensitive to very small change in data. It is crucial to avoid overfitting when using regression trees.
d_tree <- rpart(Total.Ecological.Footprint ~ Population..millions. + HDI + GDP.per.Capita + Cropland + Grazing.Land + Forest.Land + Fishing.Water + Urban.Land + Total.Biocapacity, data = dat_train_imputed)
rpart.plot(d_tree, main="Regression Tree", fallen.leaves=T, box.palette="GnBu")
predictions_d_tree <- predict(d_tree, newdata = dat_test)
RMSE_regression_tree <- sqrt(mean((predictions_d_tree[complete.cases(dat_test)] - dat_test$Total.Ecological.Footprint[complete.cases(dat_test)])**2))
We will use 10-fold cross validation to select optimal values for hyperparameters: mtry (randomly selected variables) and ntree (total number of trees in random forest)
cv.10.folds <- createMultiFolds(dat_train_imputed, k = 10, times = 5) # cross-validation 10 folds
ctrl.1 <- trainControl(method = "repeatedcv", number = 10, repeats = 3, index = cv.10.folds)
rf_model <- train(Total.Ecological.Footprint ~ Population..millions. + HDI + GDP.per.Capita + Cropland + Grazing.Land + Forest.Land + Fishing.Water + Urban.Land + Total.Biocapacity, data = dat_train_imputed, method = "rf", trControl = ctrl.1, tuneLength = 5, importance = T)
importance <- varImp(rf_model)
finaldf <- importance$importance
variable_names <- rownames(finaldf)
finaldf$Variables <- variable_names
We can visualize importance of used predictors.
ggplot(finaldf, aes(x = reorder(Variables, Overall), Overall)) +
geom_point(stat = "identity", size = 4, pch = 21, fill = "blue") +
geom_bar(stat = "identity", width = 0.075) +
xlab("Variables Names") +
coord_flip() +
theme_fivethirtyeight() +
ggtitle("Variables Ordered By Importance")
predictions_rf <- predict(rf_model, newdata = dat_test[complete.cases(dat_test), ])
y_real <- dat_test[complete.cases(dat_test), "Total.Ecological.Footprint"]
RMSE_rf <- sqrt(mean((predictions_rf - y_real)**2))
results <- tibble(
x = y_real,
y = predictions_rf,
error = abs(y_real - predictions_rf)
)
ggplot(results, aes(x, y)) +
geom_point(size = 3, shape = 1) +
geom_abline(slope = 1, intercept = 0, color = "blue", linetype = 2) +
theme_fivethirtyeight() +
ggtitle("Observed vs. Predicted") +
annotate("text", x = 5, y = 1.5, label = "RMSE = 1.17", color = "purple", size = 10)
We can see that random forest algorithm got better results than regression tree. That’s not big of a surprise because random forests are well known for robustness and good scores.
models <- tibble(
model_name = character(),
RMSE = numeric()
)
models <- add_row(models, model_name = "Multiple linear regression", RMSE = RMSE_mlr)
models <- add_row(models, model_name = "Regression tree", RMSE = RMSE_regression_tree)
models <- add_row(models, model_name = "Random forests", RMSE = RMSE_rf)
models %>% kable(caption = "RMSE of different models", col.names = c("Model name", "RMSE")) %>% kable_styling("striped", full_width = T)
| Model name | RMSE |
|---|---|
| Multiple linear regression | 0.8865465 |
| Regression tree | 1.1031298 |
| Random forests | 0.8785311 |
The End!